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ABSTRACT 

This paper considers, from a theoretical point of 
view, two measurement approaches used in measuring success and 
failure in skills tests in physicsil education. The first, **f ixed 
length** (FL) testing, entails counting the niunber of successful 
performances in a fixed number of trials. The second, 
**trials-to-cri terion** (TTC) testing, involves counting ti fe num ber of 
trials required to achieve a specified number of successes. TTC 
measurement results in high measurement error variance for 
individuals with low probabilities of success on a si^le trial. 
Error variance declines as the probability rises. If there arfe many 
more people with low probabilities than there are with high 
probabilities, which is the case for a positively skewed 
distribution, the TTC approach will result in less reliable 
measurement ^han wi],l the FL approach. Under the latter, error 
variance, is largest for people with a probability of .5. Individuals 
lower and higher will have smaller error variances. Two 
generalizations based on thesiB results can be made with regard to 
skills testing: (1) If the skills test task is one on which most 
untrained individuals perform poorly, FL testing would be the better 
choice; and (2) If the test scores tend to be negatively skewed, then 
TTC testing would be more efficient and reliable for the same total 
testing time. Two formulas are presented for estimating the 
reliability of TTC measures. (JM) 
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l^enever a skills test involves an action that may be classified 
unambiguously as successful or unsuccessful, such as shooting a free 
throw in basketball, the tester has a choice between two measurement 
approaches. The first entails counting the number of successful per- 
formances in a fixed number of trials. The second involves counting 
the number of trials required to achieve a specified number of successes 
The first of "these is by far the more common, but there are situations 
in which the second may have clear advantages. In this paper the first 
approach is called "fixed length" or FL testing. On such a test the 
higher the store is, the better the performance. The second approach 
will be referred to as "trials-to-criterion" or TTC testing. On this 
type of measurement the lower the score is, the better the performance. 
The purpose of this paper is to consider — from a theoretical point of 
view — how these approaches compare in reliability. 

In order to lay the foundation for the comparison, it will be 

necessary to present some theoretical results for the two types of 

measurement. Under either type, each examinee is assumed to have a 

personal probability of succeeding on any trial. In the literature, 

this unknown parameter of subject i is symbolized by (}). , and it is 

— 1 

assumed that during testing remains constant. Under fixed length 
testing, with _k trials for everyone, person i^'s true score equals 
k((()^). Under trials-to-criterion testing, with ^ successes required 
for the test to end, the true score of person i^ equals R/^>^. True 
score is defined here as the long-run average, or expected value, 
of -the person's observed score if the individual could be measured 
many times in the one way or the other. The variance of observed 
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scores for person ± under repeated measurement — a concept commonly 
called measurement error variance — will be k(({)_j^) (l-(f)^) under fixed 
l^ength measurement and RCl-tf)^) /(f)^ under trials-to-cr iter ion measurement. 

The|e expressions for true score and error variance are deducible 
from well know statistical distributions: the binomial distribution 
and the negative binomial or Pascal distribution. Their application 
to measurement situations is quite direct and the theories are well 
established. 

In order to compare the reliabilities of FL and TTC measurement, 
some ground rules must be adopted which render FL and TTC tests 
comparable in length. Any test A can be shown to be potentially more 
reliable than another test B if test A can be made much longer than 
test B. This demand for equity in length is made somewhat complicated 
by the fact that TTC measurement doesn't have a fixed stopping point. 
The number of trials is certain to vary from one examinee to another. 
However, if one postulates one or another population distribution 
of (j) values, as we shall do later in this paper, one can use the 
theory to deduce the population average of the number of trials 
needed per examinee. In our comparison of the two types' of measurements 
we took k, the number of trials for the FL measurement, equal to the 
theory-deduced average number under TTC measurement. This seemed 
a reasonable basis for comparison. It also turns out to have an 
unexpected, unanticipated virtue. Under this definition of comparable 
length, the value of the criterion (R) for TTC measurement does not 
influence the decision as to which form of testing is more reliable. 
A value of R equal to 5 will lead to the same conclusion as R equal 
to 10 or any other required number of successes. 
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The final bit of background theory that is needed is the var 

definition of reliability, that is , p , » ai /c^ « rr^ 

rel T/ 



lance 



Reliability equals the ratio of true score variance to observed score 
variance, and observed score variance equals the sum of true score 
variance plus error variance. Under both types of measurement, error 
variance is not the same for all examinees. It varies from person to 
person, depending upon . When this is the case , the variance of 
observed scores equals true score variance pi z the average error score 
variance, being averaged over the entire population of examinees. 

With this foundation, it is possible to get on with the comparisons. 
No one ever knows how examinees distribute themselves with respect to (j) 
Therefore, six different possibilities were considered. In each hypo- 
thetical case ranges from .2 to .8 . Some examinees are postulated 
to be rather, inept , some are assumed very proficient, and most fall 
somewhere in between. Figure 1 shows these distributions in graphical 
form. It can be seen that the distributions include a crude normal dis- 
tribution, two degrees of both positive and negative skewness, and a 
symmetrical distribution that is rather flat (platykurtic) . For pur- 
poses of TTC measurement a value of 5 was adopted for R. However, the 
adoption of five successes- was not material. The decision regarding 
which type of measurement would be more reliable in each case would 
have been the same regardless of the value chosen for R. . 

Table 1 summarizes the crucial statistics for the two types of 
measurements under each postulated distribution. To illustrate the 
meaning of the values: under a normal distribution of ^ , TTC mfeasure- 
ment would result in an average of 11+ trials per subject. This is the 
meaning of . Consistent with this value, the value of ^ for FL testing 



was taken as 11. The variance of true scores under TTC was 16.229; under 
FL measurement true score variance equaled 2.468. Observed score variances 
were about 33 a^d 5, respectively. In a numerical sense, :,ubjects spread 
out much more under TTC measurement than under FL measurement, in which 
everyone is allotted exactly 11 trials. But these quantities are not 
the primary facts of interest here. 

The most important facts are the reliability coefficients in the 
last row of the upper and lower halves 'of the table. These values in- 
dicate which type of measurement is superior, in terms of reliability, 
for each population. As one may see, in some cases the advantage lies 
with TTC testing and in other cases with FL testing. The trends may 
be summarized as follows: 

1) When the values of (J) are close to being normally distributed around 
<|> « .5, the approaches are about equal in reliability. (A normal 
distribution centering around « .6, and ^ « .7, gave practically 
the same results.) 

T 

2) When the bulk of the distribution is below <|) « .5, and only a light 

i 

tail extends^ upward toward ^ « .8, FL is the better approach. The 
stronger the degree of ^positive skewness, the more marked the FL 
superiority. 

3) When the bulk of the 4; distribution is above (|) « .5, and only a long 
tail '3xtends downward toward (J) « .2, TTC is the better approach. 
The stronger the degree of negative skewness, the more marked is the 
TTC superiority. 

4) Platykurtosis , accompanied by symmetry, results in an advantage for 
FL measurement. Heaviness in the upper range of (J) values does not 
compensate for similar heaviness in the lower range of (J) values. Thus, 
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as the symmetry of the normal distribution shifts toward the 
X symmetry of a platykurtic distribution, the equal reliability 

situation changes to an advantage for FL measurement. 

To summarize the trends briefly, TTC measurement results in high 
measurement error variance for individuals with low prol?abilities of 
success on a single trial. Error variance declines as the probability 
rises. If there are many more people with low probabilities than there 
are with high probabilities , which is the case for a positively skewed 
distribution, the TTC approach will result in less reliable measurement 
than the FL approach. Under the latter, error variance is largest for 
people with a probability of .5^ Individuals lower and higher will 
have smaller error variances. 

What implications do these results have for skills testing? A 
few tentative generalizations may be offered. If the skills test task 
is one on which most untrained individuals perform poorly (say, ^ < .5), 
FL testing would be the better choice. Such might be the case with pre- 
instruction tests, placement tests, or any measurement likely to yield 
scores that are positively skewed. If the test scores tend to be 
negatively skewed, then TTC testing would be more efficient and reli- 
able for the same total testing time. This is more likely to be true 
of post-instruction scores than pre-instruction scores, although it 
cpuld be true of both. Symmetrical distributions, particularly those 
that are "f latter- than-normal call for the use of the FL approach. 

These recommendations are predicated on the use of a value of k 
reasonably close to the expected value of R/(f>, If TTC is not used, 
there is no reason to specify R, nor would the examiner ever know the 
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expected value of R/(f. ' However, if the choices of k and R were made 
equitably, the foregoing recommendations would apply. The recommenda- 
tions are valid in the sense of getting the highest reliability out of 
the total number of trials by all examinees. 

To conclude this paper, two formulas are presented for estimating 
the reliability of TIC measures. These formulas have been derived by 
Dr. Judy Spray and her students. The first formula bears a striking 
resemblance to the familiar KRy^21. The second is- the general form of 
Cronbach'^s coefficient alpha. For this application of coefficient 
alpha, the TTC test is perceived as having R parts. The first part 
ends with the first successful .trial , the second part ends with the 
second successful trial, and so on. The score of subject i on part j 
is the number of additional trials required by the subject to achieve 
the j th success after achieving the (j-l)slt success. These formulas, 
can be shown to be algebraically identical when population parameters 
are substituted in each. They are not necessarily equal when sample 
statistics are used. Investigations are underway to compare these 
formulas with respect to bias and sampling error. 
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Figure ^1. Six Population ISistributions of the Probability of Succes. 
for a Single Trial on a Hypothetical Skills Test 
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Table 1 • ' 

Mean True Score, True Score Variance, Error Score 

Variance, Observed Score Variance and Reliability 

for Six Populations^ on a Hypothetical Skills Test 
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Formula Sheet for FL and TXC Measurement 



1, True score for person i 

I 

2, Population mean true score 

3, True store variance 



FL 



4, Error scdre variance of person i 



5. Population mean error variance 



6. Observed score variance 



7. Reliability (theorelical) 



% 1. 



3. Reliability estimation 
formulas 



^ ^0 ■*'M0O~(t>) 
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TTC 



R 



R Co- - I ^/i ' 



Y. = number of trials needed 
J 

to achieve success j after 
achieving success (j-1) 
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